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' Abstract. The estimation of probabilistic deformable template models in computer vision or of probabilistic 

atlases in Computational Anatomy are core issues in both fields. A first coherent statistical framework where the 
geometrical variability is modelled as a hidden random variable has been given by Allassonniere, Amit and Trouve in 
, [1]. They introduce a Bayesian approach and mixture of them to estimate deformable template models. A consistent 

stochastic algorithm has been introduced in [2] to face the problem encountered in [1] for the convergence of the 
estimation algorithm for the one component model in the presence of noise. We propose here to go on in this 
direction of using some "SAEM-like" algorithm to approximate the MAP estimator in the general Bayesian setting 
of mixture of deformable template models. We also prove the convergence of our algorithm toward a critical point 

VP , of the penalised likelihood of the observations and illustrate this with handwritten digit images and medical images. 
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C3 • 1. Introduction. The issue of representing and analysing some geometrical structures upon 

' which some deformations can act is a challenging question in applied mathematics as well as 

1—1 1 in Computational Anatomy. One central point is the modelisation of varying objects, and the 

quantification of this variability with respect to one or several reference models which will be 
called templates. This is known as "Deformable Templates" [12]. To our best knowledge, the 
t-H ' problem of constructing probabilistic models of variable shapes in order to statistically quantify 

C\| . this variability has not been successfully addressed yet in spite of its importance. For example, 

modelling the anatomical variability of organs around an ideal shape is of a crucial interest in the 
medical domain in order to find some characteristic differences between populations (pathological 
' and control) , or to exhibit some pathological kind of deformations or shapes of an organ. 

■ Many solutions have been proposed to face the problem of the template definition. They 

f~>) | go from some generalised Procruste's means with a variational [11] or statistical [10] point of 

• • . view to some statistical models like Active Appearance Model [6] or Minimum Description Length 

. £^ ' methods [16]. Unfortunately, all these methods only focus on the template whereas the geometrical 

variability is computed afterwards (using PC A). This contradicts with the fact that a metric is 
required to compute the template through the computation of deformations. Moreover, they do 
not really differ from the variational point of view since they consider the deformations as some 
nuisance parameters which have to be estimated and not as some unobserved random variables. 

The main goal of this paper is to propose a coherent estimation of both photometric model 
and geometrical distribution in a given population. Another issue addressed here is the clustering 
problem. Given a set of images, the statistical estimation of the component weights and of the 
image labels is usually supervised, at least the number of components is fixed. The templates of 
each component and the label are estimated iteratively (for example in methods like K-means) 
but the geometry, and related to this the metric used to compute the distances between elements, 
is still fixed. Moreover, the label, which is not observed, is, as the deformations, considered as 
a parameter and not as a hidden random variable. These methods do not lead to a statistical 
coherent framework for the understanding of deformable template estimation and all these iterative 
algorithms derived from those approaches do not have a statistical interpretation as the parameter 
optimisation of a generative model describing the data. 

In this paper we consider the statistical framework for dense deformable templates developed 
by Allassonniere, Amit and Trouve in [1] in the generalised case of mixture model for multicom- 
ponent estimation. Each image taken from a database is supposed to be generated from a noisy 
random deformation of a template image picked randomly among a given set of possible templates. 
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All the templates are assumed to be drawn from a common prior distribution on the template 
image space. To propose a generative model, each deformation and each image label have to be 
considered as hidden variables. The templates, the parameters of the deformation laws and the 
components weights are the parameters of interest. This generative model allows to automati- 
cally decompose the database into components and, at the same time, estimates the parameters 
corresponding to each component while increasing the likelihood of the observations. 

Given this parametric statistical Bayesian model, the parameter estimation is performed in 
[1] by a Maximum A Posteriori (MAP). The authors carry out this estimation problem using a 
deterministic and iterative scheme based on the EM (Expectation Maximisation) algorithm where 
the posterior distribution is approximated by a Dirac measure on its mode. Unfortunately, this 
gives an algorithm whose convergence toward the MAP estimator cannot be proved. Moreover, 
as shown by the experiments in that paper, the convergence is lost within a noisy setting. 

Our goal in this paper is to propose some stochastic iterative method to reach the MAP 
estimator for which we will be able to get a convergence result as already done for the one 
component case in [2]. We propose to use a stochastic version of the EM algorithm to reach 
the maximum of the posterior distribution. We use the Stochastic Approximation EM (SAEM) 
algorithm introduced by Delyon et al in [7] coupled with a Monte Carlo Markov Chain (MCMC) 
method. This coupling algorithm has been introduced by Kuhn and Lavielle in [14] in the case 
where the missing variables had a compact support. Contrary to the one component model 
where we can couple the iteration of the SAEM algorithm with the Markov chain evolution (cf. 
[2]), we show here that it cannot be driven numerically We need to consider an alternative 
method. We propose to simulate the hidden variables using some auxiliary Markov chains, one 
per component, to approach the posterior distribution. We prove the convergence of our algorithm 
for a non compact setting by adapting Delyon's theorem about general stochastic approximations 
and introducing truncation on random boundaries as in [5] . 

The paper is organised as follows: in Section 2 we first recall the observation mixture model 
proposed by Allassonniere, Amit and Trouve in [1]. In Section 3, we describe the stochastic algo- 
rithm used in our particular setting. Section 4 is devoted to the convergence theorem. Experiments 
on 2D real data sets are presented in Section 5. The proofs of the convergence of the algorithm 
are postponed in Section 6 whereas Conclusion and Discussion are given in Section 7. 

2. The observation model. We consider the framework of multicomponent model in- 
troduced in [1]. Given a sample of gray level images (yi)i<i< n observed on a grid of pixels 
{v u £ D C M. 2 ,u G A} where D is a continuous domain and A the pixel network, we are looking 
for some template images which explain the population. Each of these images is a real function 
Iq : M 2 — *• K defined on the whole plane. An observation y is supposed to be a discretisation on A 
of a deformation of one of the templates plus an independent additive noise. This leads to assume 
the existence of an unobserved deformation field z : M 2 — > K 2 such that for «eA: 

y(u) = Iq{v u - z(v u )) + e(u) , 

where e denotes an additive noise. 

2.1. Models for templates and deformations. We use the same framework as chosen in 
[1] to describe both the templates Iq and the deformation fields z. Our model takes into account 
two complementary sides: photometric -indexed by p- corresponding to the templates and the 
noise variances, and geometric -indexed by g- corresponding to the deformations. The templates 
Jo and the deformations z are assumed to belong to some finite dimensional subspaces of two 
reproducing kernels Hilbert spaces V p and V g (determined by their respective kernels K p and K g ). 
We choose a representation of both of them by finite linear combinations of the kernels centred 
at some fixed landmark points in the domain D: {Vp,j)i<j<k p respectively (i>g,j)i<j<k„' They are 
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therefore parametrised by the coefficients a € R fcp and (3 e (R fc s) 2 which yield: Vw £ £>, 



7 Q (i>) = (K p a)(v) =^2K p (v,v p j)oP , 
3=1 

kg 

2.2. Parametrical model. In this paper, we consider a mixture of the deformablc template 
models which allows a fixed number r m of components in each training set. This means that the 
data will be separated in r m (at most) different components by the algorithm. 

Therefore, for each observation yj, we consider the pair (/3i,Tj) of unobserved variables which 
correspond respectively to the deformation field and to the label of image i. We denote below by 
y* - (y{, ■ ■ ■ , Vn), by /3* = ($,... , (3D and by r* = (n, . . . , r„). The generative model is: 

Tm 

T ~ ®"=1 E Mt I (Pt)l<t<r m , 
t=l 

< |9 ~ ®" = i^V(o, r fl)Ti ) | r, (r ffit )i< t < Tm , (2 - x) 

y ~ ®iLiN'(zi3 i I aTi ,o-l i Id\ A .\) | A r, (ai,o- t 2 )i<t<T TO , 

where zpl a (u) — I a (v u — zp(v u )) is the action of the deformation on the template / a , for u 
in A and St is the Dirac function on t. The parameters of interest are the vectors (at)i<t< Tm 
coding the templates, the variances (c|)i<t<r m of the additive noises, the covariance matri- 
ces (r Sit )i< t < Tm of the deformation fields and the component weights (pt)i<t< Tm . We denote 
by (Ot, Pt)i<t<T m the parameters so that 8t corresponds to the parameters composed of the 
photometric part (at, of) and the geometric part r 9jt for component t. We assume that for 
all 1 < t < t to , the parameter 6 t = (at, erf, ^g,t) belongs to the open space defined as 
6 = { (a, (J 2 ,T g ) | a £ R kp , \a\ < R, a > 0, r g e Sym^ fcg ^(R) } , where R is an arbitrary positive 

constant and Sym^ fe t (R) is the set of strictly positive symmetric matrices. Concerning the weights 

(pt)i<t<r m , wc assume that they belong to the set g — < (pt)i<t<r m G]0, l[ Tm | ^2 pt = 1 

I ™ t=i 

Remark 1 . TTiis yields a generative model: given the parameters of the model, to get a real- 

isation of an image, we first draw a label t with respect to the probability law ^2 PtSt- Then, we 

t=i 

simulate a deformation field (3 using the covariance matrix corresponding to component r accord- 
ing to Af(0,T 'g. T ) . We apply it to the template of the T th component. Last, we add an independent 
Gaussian noise of variance a\ . 

We choose a normal distribution for the unobserved deformation variable because of the back- 
ground we have in image analysis. Indeed, the registration problem is an issue that has been 
studied deeply for the past two decades. The goal is, given two images, to find the best defor- 
mation that will match one image close to the other. Such methods require to choose the kind 
of deformations that are allowed (smooth, diffeomorphic, etc). These conditions arc equivalent, 
for some of these methods, to choose a covariance matrix that enables to define an inner product 
between two deformations coded by a vector (3 (cf. [18, 4]). The regularisation term of the match- 
ing energy in the small deformation framework treated in this paper can be written as : /J'! 1 " 1 /?. 
This looks like the logarithm of the density of a Gaussian distribution on [3 with mean and a 
covariance matrix T g . The link between these two points of view has been given in [1] ; the mode 
of the posterior distribution equals the solution of a general matching problem. This is why we 
therefore set on the deformation vector [3 such a distribution. Moreover, many experiments have 
been run using a large variety of such a matrix which gives us now a good initial guess for our 
parameter. This leads us to consider a Bayesian approach with a weakly informative prior. 
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2.3. The Bayesian approach. The information given by the image analysis background 
is here introduced mathematically in terms of prior laws on the parameters of model (2.1). As 
already mentioned in the previous paragraph, this background knowledge enables to determine a 
good initial guess for the laws and the values of the hyper-paramctcrs. As well as for the covariance 
matrix T g , the same arguments are true for the noise variance a 2 . In the registration viewpoint, 
this variance is the tradeoff between the deformation cost and the data attachment term that 
compose the energy to minimise. An empirical good initial guess is therefore known as well. 

On another hand, the high dimensionality of the parameters can lead to degenerated maximum 
likelihood estimator when the training sample is small. While introducing prior distributions, the 
estimation with small samples is still possible. The importance of these prior distributions in 
the estimation problem has been shown in [1]. The solution of the estimation equation can be 
interpreted as barycenters between the hyper-parameters of the priors and the empirical values. 
This ensures easy computations and other theoretical properties as for example, the invertibility of 
the covariance matrix T g . The role of the other hyper-parameters arc discussed in the experiments. 

We use a generative model which includes natural standard conjugate prior distributions with 
fixed hyper-paramctcrs. These distributions are an inverse- Wishart priors on each T gt and of and 
a normal prior on each at, for all 1 < t < r m . All priors are assumed independent. Then, 

v p (da,da 2 ) oc exp ( -\{a - p,pf {T, p )~ l {a - fj, p ) ) ( exp ( -7^% ) -4= ) da 2 da, a p > 3 , 



2 

u g (dT g )c (cxp(-(r- 1 ,I] 9 ) F /2) 



where (A, B)f = tr(A t B) is the scalar product of two matrices A and B and tr stands for the 
trace. 

For the prior law v p , we choose the Dirichlet distribution, T>(a p ), with density 




fp(p) oc I J^J p t J , with fixed parameter a p 



\t=i 



3. Parameter estimation using a stochastic version of the EM algorithm. For the 
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sake of simplicity, let us denote by N = 2nk g and by T = {1, . . . , r m }" so that the missing de 



formation variables take their values in K.^ and the missing labels in T. We also introduce the 
following notations: r\ = (6,p) with 9 = (0t)i<t<T m an d p = (pt) 1<t<Tm ■ 

In our Bayesian framework, we choose the MAP estimator to estimate the parameters: 



fj n = argmax , (3.1) 

v 

where qB {v\y) denotes the distribution of 77 conditionally to y. 

Remark 2 . Even if we are working in a Bayesian framework, we do not want to estimate the 
distributions of our parameters. Knowing the distribution of the template image and its possible 
deformations is not of great interest from an image analysis point of view. Indeed, people are more 
interested, in particular in the medical imaging community, in an atlas which characterises the 
populations of shapes that they consider rather than its distribution. Moreover, the distribution of 
the deformation law makes even less sense. This is the reason why we focus on the MAP. 

In practice, to reach this estimator, we maximise this posterior distribution using a Stochastic 
Approximation EM (SAEM) algorithm coupled with a Monte Carlo Markov Chain (MCMC) 
method. Indeed, due to the intractable computation of the E step of the EM algorithm introduced 
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by [8] encountered in this complex non linear setting, we follow a stochastic variation called SAEM 
proposed in [7]. However, again due to the expression of our model, the simulation required in 
this algorithm cannot be performed directly. Therefore, we propose to use some MCMC methods 
to reach this simulation as proposed by Kuhn and Laviclle in [14] and done for the one component 
model in [2] . Unfortunately, the direct generalisation of the algorithm presented in [2] paper turns 
out to be of no use in practice because of some trapping state problems (cf. Subsection 3.2.). This 
suggests to go back to some other extension of the SAEM procedure. 

3.1. The SAEM algorithm using MCMC methods. Let us first recall the SAEM algo- 
rithm. It generates a sequence of estimated parameters (r]k) k which converges towards a critical 
point of 7] logq(y,7y) under some mild assumptions (cf. [7]). These critical points coincide with 
the critical points of i] t— > log qs (r]\y) ■ The k th iteration consists in three steps: 
Simulation step : the missing data, here the deformation parameters and the labels, ((3,t), are 

drawn with respect to the distribution of (/3, t) conditionally to y denoted by 7r, ; , using 

the current parameter r)k-i 

G9 fcJ T fc )~7r % _ 1) (3.2) 

Stochastic approximation step : given (Ak)k a decreasing sequence of positive step-sizes, a 
stochastic approximation is done on the quantity logq(y, f3, t, rj), using the simulated 
value of the missing data : 

Qk(v) = Qk-i(v) + A fc [logq(y,/3 fc ,r fc ,77) - Q k -i(v)] > (3-3) 
Maximisation step : the parameters are updated in the M-step, 

T] k = argmaxQ fc (?7) . (3.4) 
v 

Initial values Qq and rjo are arbitrarily chosen. 

We notice that the density function of the model proposed in paragraphs 2.2 and 2.3 belongs 
to the curved exponential family. That is to say that the complete likelihood can be written as: 
q(y,/3,T,rj) = cxp [—7/1(77) + (S ((3 , t) , <j)(r]))] , where the sufficient statistic S is a Borcl function 
on R N PT taking its values in an open subset iS of R m and ip, <j> two Borel functions on OPg. 
(Note that S, 4> and tp may depend also on y, but since y will stay fixed in the sequel, we omit 
this dependency). Thanks to this property of our model, it is equivalent to do the stochastic 
approximation on the complete log-likelihood as well as on the sufficient statistics. This yields 
equation (3.3) to be replaced by the following stochastic approximation s of the sufficient statistics 
S : 

s k = s fc -i + A k (S(f3 k ,T k ) - a fc _i)P. (3.5) 

We now introduce the following function: L : SP&Pg —> R as L(s; rj) = —i/j(rf) + (s, 4>{r))) . It 
has been proved in [1] that there exists a critical function fj : S — ► QPg which is a zero of VL. It 
is straightforward to prove that this function satisfies: Vrj € 0Pg,Vs G S, L(s;fj(s)) > L{s\rj) so 
that the maximisation step (3.4) becomes: 

Vk = v{s k ) ■ 

Concerning the simulation step, in our model, the simulation of the missing variables with 
respect to the conditional distribution ir v cannot be carried out. Indeed, its probability density 
function (pdf) has a close form but rather complicated ; it does not correspond to some usual pdf. 
One solution proposed in [14] for such cases is to couple the SAEM algorithm with Monte Carlo 
Markov Chain (MCMC) method. However, we do not fit exactly into their requirements since 
the missing variable (3 does not have a compact support. We introduce an ergodic Markov chain 
whose stationary distribution is the conditional distribution 7r r; . We denote its transition kernel 
by n ?) . The simulation step (3.2) is thus replaced by the following step: 
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G9 fc ,r fc )~n w _ 1 ((/3 t _ 1) T fc _ 1 ),.). 



(3.6) 



The most common choice of kernel is an accept-reject step which is carried out through 
a Metropolis-Hastings algorithm. Unfortunately, in our particular setting, we deal with large 
dimensions for the missing variables. This made us move to some other kind of MCMC methods, 
like a Gibbs sampler, to simulate our missing variables. 

3.2. The transition of the MCMC method: a hybrid Gibbs sampler. If we consider 
the full vector (/3, r) as a single vector of missing data, we can use the hybrid Gibbs sampler on 
R W PT as follows. For any b £ R and 1 < j < N, let us denote by flb^j the unique configuration 
which is equal to (3 everywhere except the coordinate j where (3 J b ^j = b and by (3~ 3 the vector 
(3 without the coordinate j. Each coordinate of the deformation field (3 3 is updated using a 
Metropolis-Hastings step where the proposal is given by the conditional distribution of /3- , |/3~-', r 
coming from the current Gaussian distribution with the corresponding parameters (pointed by 
t) . Then, the last coordinates corresponding to the missing variable r are drawn with respect to 



Even if this procedure provides an estimated parameter sequence which would theoretically 
converge toward the MAP estimator, in practice, as mentioned in [19], it would take a quite long 
time to reach its limit because of the trapping state problem: when a small number of observations 
are assigned to a component, the estimation of the component parameters is hardly concentrated 
and the probability of changing the label of an image to this component or from this component 
to another is really small (most of the time under the computer precision). 

We can interpret this from an image analysis viewpoint: the first iteration of the algorithm 
gives a random label to the training set and computes the corresponding maximiser ?y = (0,p). 
Then, for each image, according to its current label, it simulates a deformation field which only 
takes into account the parameters of this given component. Indeed, the simulation of (3 through the 
Gibbs sampler involves a proposal whose corresponding Markov chain has q(/3|r, y, if) as stationary 
distribution. Therefore, the deformation tries to match y to the deformed template of the given 
component t. The deformation field tries to get a better connection between the component 
parameters and the observation, and there is only small probability that the observation given 
this deformation field will be closer to another component. The update of the label r is therefore 
conditional to this deformation which would not leave much chance to switch component. 

To overcome the trapping state problem, we will simulate the optimal label, using as many 
Markov chains in (3 as the number of components so that each component has a corresponding 
deformation which "computes" its distance to the observation. Then we can simulate the optimal 
deformation corresponding to that optimal label. 

Since we aim to simulate (/3, t) through a transition kernel that has q((3, T\y, rj) as stationary 
distribution, we simulate t with a kernel whose stationary distribution is q{T\y, rj) and then (3 
through a transition kernel that has q((3\T, y, 77) as stationary distribution. 

For the first step, we need to compute the weights q(t\yi, 77) ex q(t, yi\r]) for all 1 < t < t,„ and 
all 1 < i < n which cannot be easily reached. However, for any density function /, for any image 
Hi and for any 1 < t < T m , we have 



Obviously the computation of this expectation w.r.t. the posterior distribution is not tractable 
either but we can approximate it by a Monte Carlo sum. However, we cannot easily simulate 
variables through the posterior distribution q{-\yi, t, rj) as well, so we use some realisations of an 
ergodic Markov chain having q(-\yi,t, 77) as stationary distribution instead of some independent 
realisations of this distribution. 

The solution we propose is the following: suppose we are at the k th iteration of the algorithm 
and let r\ be the current parameters. Given any initial deformation field £0 € R 2fcfl , we run, for 



q(T\P,y,r))- 




(3.7) 
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each component t, the hybrid Gibbs sampler IL t on M 2k <' J times so that we get J elements £ t j = 
Kti)i<l<j of an ergodic homogeneous Markov chain whose stationary distribution is q(-\yi, t, rf). 
Let us denote by £j = (Ct,i)i<i<r m the matrix of all the auxiliary variables. We then use these 
elements for the computation of the weights pj(t\£i, yi, rj) through a Monte Carlo sum: 



(l 3 

pj(t\€i,Vi,y) « 



J 

1=1 



/($) 



(3- 



where the normalisation is done such that their sum over t equals one, involving the dependence on 
all the auxiliary variables £j. The ergodic theorem ensures the convergence of our approximation 

Ttn 

toward the expected value. We then simulate r through <8>™ =1 Pjif\^,iiVi^ r i)^t- 



t=i 



Concerning the second step, we update (3 by re-running J times the hybrid Gibbs sampler 
IT,^.,- on starting from a random initial point /3 in a compact subset of M. N . The size of J will 
depend on the iteration k of the SAEM algorithm in a sense that will be precised later, thus we 
now index it by k. 

The density function / involved in the Monte Carlo sum above needs to be specified to get 
the convergence result proved in the last section of this paper. We show that using the prior on 
the deformation field enables to get the sufficient conditions for convergence. This density is the 
Gaussian density function and depends on the component we are working with: 

MO = * — cx P . (3.9) 



^ a VW^t 

Algorithm 2 shows the detailed iteration. 

Remark 3. The use of one simulation of (3 per component is a point that was already used 
in [1] while computing the best matching, f3* , for all components by minimising the corresponding 
energies. This gives as many (3* as components for each image. Then, according to these best 
matchings, it computed the best component which therefore pointed the matching to consider. 

3.3. Truncation on random boundaries. Since our missing data have a non-compact 
support, some of the convergence assumptions of such algorithms [14] are not satisfied. This leads 
to consider a truncation algorithm as suggested in [7] and extended in [2]. 

Let {K,q)q>o be an increasing sequence of compact subsets of £ such as U q >olC q = S and IC q C 
int(/C ? +i), V<2 > 0. Let K be a compact subset of l w . Let II r; be a transition kernel of an ergodic 
Markov chain on R w having ir^ as stationary distribution. We construct the homogeneous Markov 
chain {{(3 k , T k , Sk, «fe))fe>o as explained in Algorithm 1. As long as the stochastic approximation 
docs not wander out the current compact set, we run our "SAEM-MCMC" algorithm. As soon 
as this previous condition is not satisfied, we reinitialise the sequences of s and (/3, t) using a 
projection (for more details see [7]). The current compact is then enlarge. To point toward the 
current compact, we use a counter sequence {Kk)k which remains unchanged when the previous 
condition is satisfied and increases to point toward a bigger compact when re-projecting. 

4. Convergence theorem of the multicomponent procedure. In this particular section 
the variances of the components (of )i<i< Tm are fixed. Alleviating this condition is not straight- 
forward and is an issue of our future work. 

To prove the convergence of our parameter estimate toward the MAP, we have to go back to 
a convergence theorem which deals with general stochastic approximations. Indeed, the SAEM- 
MCMC algorithm introduced and detailed above is a Robbins-Monro type stochastic approxima- 
tion procedure. One common tool to prove the w.p.l convergence of such a stochastic approxima- 
tion has been introduced by Kushncr and Clark in [15]. However, some of the assumptions they 
require are intractable with our procedure (in particular concerning the mean field defined below) . 
This leads us to slightly adapt the convergence theorem for stochastic approximations given in [7] . 

We consider the following Robbins-Monro stochastic approximation procedure: 
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Algorithm 1 Stochastic approximation with truncation on random boundaries 
Set /3 £ K, to e T, sq G /Co and ko = 0. 
for all k > 1 do 

compute s = s fc _i + A k (S(j3,t) - s fc _i) 

where (/3,t) are sampled from a transition kernel n %1 (see Algorithm 2). 

if s G /C Kfc _ 1 then 

set (s k ,f3 k ,T k ) = (s,/3,r) and K fe = 
else 

set (s kl p k ,T k ) = (J,j8,f) G /CqPKPT and Kfc = + 1 
and (s,/3) can be chosen through different ways (cf . [7]). 
end if 

T] k = argmax77(s fe ). 
v 

end for 



Algorithm 2 Transition step fc 



1 using a hybrid Gibbs sampler on (J3, t) 



Require: r] = r] k , J = J k 
for all i = 1 : n do 

for all t = 1 : T m do 

t(o) _ t 
?t,< - so 

for all / = 1 : J do 

Gibbs sampler II l; . t : 
for all j = 1 : 2fc g do 

Metropolis-Hastings procedure: 

6~?(6|r 3 ' ) *,»7) 
Compute r, (£ J , 6; £ _J ; , r], t) = 
u~W[0,l] 

if u < rj(£i jb;^ - ! ,r),t) then 

^ = 6 
end if 

end for 

t(0 _ <r 
?t,i — 5 

end for 



3(wl£,*i»7) 



pj(t\£,i,yt,v) oc ( 



Z=l 



g(j/i.Cii,t|»7)J 



end for 
end for 



fe=iX)pj(*lfi,W.»7)*t and /3 ft+1 ~Il£ Tfe+1 (/3 



s fc = s fc _i + A fe (ft(sfe_x) + e fe + rfc) , 



(4.1) 



where (efc)fc>i and (r"fc)fc>i are random processes defined on the same probability space taking 
their values in an open subset S of R ns ; ft, is referred to as the mean field of the algorithm; (r k ) k >i 
is a remainder term and (e k )k>i is a stochastic excitation. 
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To be able to get a convergence result, we consider the truncated sequence (sk) k defined as 
follow: let 5„c5 and s k = s fe _i + A fc /i(s fc _ 1 ) + A fc e fc + A fe r fc , where 



if s k G AC Kfc _ I 

if Sfe ^ x; refe _ 1 



Sfe = Sfc , 

sa.. = s k G 5 a , 

Kfe = Kfe_l + 1 . 



The projection s k can be made through different ways (cf. [7]). 

We will use Delyon's theorem which gives sufficient conditions for the sequence (s k )k>o trun- 
cated on random boundaries to converge with probability one. The theorem we state here is a 
generalisation of the theorem presented in [7]. Indeed, we have add the existence of an absorbing 
set for the stochastic approximation sequence. The proof of this theorem can be carried out the 
same way Delyon et al. do theirs adding the absorbing set. This is why it is not detailed here. 

Theorem 4.1. We consider the sequence (s k )k>o given by the truncated procedure (4.2). 
Assume that : 

(SAO') There exists a closed convex set S a C S such that for all k > 0, G S a w.p.l. 

oc 

(SA1) (Afe)fe>i is a decreasing sequence of positive numbers such that ^2 Afe = oo. 

fe=i 

(SA2) The vector field h is continuous on S ou S a and there exists a continuously differentiable 
function w : S — > M such that 

(i) for all s G S, F(s) = (d s w(s), h(s)) < 0. 

(ii) inb{w{£)) = 0, where C = {s G S : F(s) = 0}. 

(STAB1') There exist a continuous differentiable function W : M. N — > R and a compact set K, 
such that 

(i) For all c > 0, we have W c niS a is a compact subset of S where W c = {s G S : W(s) < 

c} is a level set. 

(ii) (d s W{s), h(s)) < 0, for alls€S\tC. 

p 

(STAB2) For any positive integer M, w.p.l lim V] A/~ei~~&w(s l_,)<ih exists and is finite and 
w.p.l lim sup |rj : |lv^( s , e _ 1 )<M = 0. 

k — >oo 

Then, w.p.l, limsupd(sk,£') = 0. 

k — >oc 

Assumption (SA2), which involves a Lyapunov function w, replaces the usual condition that, 
w.p.l, the sequence comes back infinitely often in a compact set which is not easy to check in prac- 
tice. In addition, assumptions (STAB1') and (STAB2) give a recurrent condition introducing a 
Lyapunov function W which controls the excursions outside the compact sets. The two Lyapunov 
functions w and W do not need to be the same. Another interesting point is that the truncation 
does not change the mean field and therefore the stationary points of the sequence. 



This theorem does not ensure the convergence of the sequence to a maximum of the likelihood 
but to one of its critical points. To ensure that the critical point reached is a maximum, we would 
have to satisfy two other conditions (called (LOC1-2) in [7]) which are typical conditions. That 
is to say, it requires that the critical points are isolated and for every stationary points s* G £' 
the Hessian matrix of the observed log-likelihood is negative definite. 

We now want to apply this theorem to prove the convergence of our "SAEM like" procedure 
where the missing variables are not simulated through the posterior density function but by a 
kernel which can be as close as wanted -increasing J k - to this posterior law (generalising Theorem 
3 in [7]). 

Let us consider the following stochastic approximation: (f3 k ,Tk) arc simulated by the transi- 
tion kernel described in the previous section and 



Sk = Sfe-l + &k(S(0 k ,Tk) - Sfc-l) , 
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which can be connected to the Robbins-Monro procedure using the notations introduced in [7]: 
let T = (J-k)k>i be the filtration where T k is the a— algebra generated by the random variables 
{So, Pi, ■ ■ ■ ,P k , Ti, . . . ,Tfe), is the expectation with respect to the posterior distribution tt^ 
and 



/i(sfc_i) = E^,^ ^ [S{(3, t)} - s fe _! , 

e k = S(p k ,T k )--E[S(p k ,T k )\? k - 1 ] , 

r k = E[S(p k ,T k )\F k ^}-E T . lBk _ i) [509, r)] . 

Theorem 4.2. Let w(s) = -l(fj(s)) where l(rj) = logJ2 J RN q(y, (3, T, rj)d(3 and h(s) = 

T 

X) Jr« (5(/3, r) — s)7Tjj( s )(/3, r)d/3 /or se5. Assume that: 

T 

(Al) £/ie sequences (A k )k>i and (J k )k>i satisfy : 

oo oo 

(i) (Afc)fc>i zs non-increasing, positive, ^ A& = oo and ^ A| < oo . 

fc=i fe=i 

(ii) (Jfe)fe>i tafces its values in the set of positive integers and lim Jk = oo . 

fc — >oc 

(A2) £' = {s £ 5, (d s w(s), h(s)) = 0} «s included in a level set of w. 

Let (sk)k>o be the truncated sequence defined in equation (4-2), K a compact set of R N and 
/Co C S(M. N ) a compact subset of S. Then, for all f3 E K, r £ T and s G JC , we have 

lim d(s k ,C) = Pfl n ,To,so,o -os. , 

where P/3 0iXO)S o,o * s ^ e probability measure associated with the chain (Z k = (/3 k , T k , s k , K k )) k >o 
starting at (J3 , Tq, Sq, 0). 

The first assumption which concerns the two sequences involved in the algorithm, is not 
restrictive at all since these sequences can be chosen arbitrarily. 

The second assumption, however, is more complex. This is required to satisfy the assumptions 
of Theorem 4.1. This is a condition we have not proved yet and is part of our future work. 

Proof. The proof of this theorem is given in Section 6. We give here a quick sketch to emphasise 
the main difficulties and differences between our proof and the convergence proof of the SAEM 
algorithm given in [7]. 

Even if the only algorithmic difference between our algorithm and the SAEM algorithm is the 
simulation of the missing data which is not done with respect to the posterior law q(f3, r|y, r}) 
but through an approximation which can be arbitrarily close, this yields a very different proof. 
Indeed, whereas for the SAEM algorithm, the stochastic approximation leads to a Robbins-Monro 
type equation (4.1) with no residual term r k , our method induces one. The first difficulty is 
therefore to prove that this residual term tends to while the number of iterations k tends to 
infinity. Our proof is decomposed into two part, the first one concerning the deformation variable 
(3 and the second one the label r. The first term requires to prove the geometric ergodicity of 
the Markov chain in f3 generated through our kernel. For this purpose, we prove some typical 
sufficient conditions which include the existence of a small set for the transition kernel and a drift 
condition. Then, we use for the second term some concentration inequalities for non stationary 
Markov chains to prove that the kernel associated with the label distribution converges toward 
the posterior distribution q{T\y,rf). 

The second difficulty is to prove the convergence of the excitation term e^. This can be carried 
out as in [7] using the properties of our Markov chain and some martingale limits properties. □ 

Corollary 1. Under the assumptions of Theorem we have for all f3 £ K, ro G T, sq g 
S a and 7/0 G QVg, 

lim d(r) k ,C) = f/3 ,-r . So .o-a.s , 

k — *-oo 
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where P/3 jTo ,so,o, * s the probability measure associated with the chain (Z^ = (/3 fc , t%, s/j, Kk))k>o 
starting at (J3 , r , s , 0) and £ = {Pr? e f/(S),P§^(n) = 0}. 

Proof. This is a direct consequence of the smoothness of the function s i— ► fj(s) on 5 and of 
Lemma 2 of [7] which links the sets C and £ . □ 

5. Experiments. 

5.1. USPS database. To illustrate the previous algorithm for the deformable template 
model, we are considering handwritten digit images. For each digit, referred as class later, we 
learn two templates, the corresponding noise variances and the geometric covariance matrices. We 
use the USPS database which contains a training set of around 7000 images. Each picture is a 
(16P16) gray level image with intensity in [0, 2] where corresponds to the black background. In 
Figure 5.1 we show some of the training images used for the statistical estimation. 

Niimiiimimiiiiiiiiiiiiiimiiiiii 

mmmmwmiwummmmim 
tfmmiJHmMiitwitwimwmi 

Fig. 5.1. Some examples of the training set: 40 images per class (inverse video). 

5.1.1. General setting of the algorithm. A natural choice for the prior laws on a and 
r g is to set for the mean on a and to induce the two covariance matrices by the metric of the 
spaces V p and V g involving the correlation between the landmarks through the kernels: define the 
square matrices M p (k,k') = K p (v p j,v p j') VI < k, k' < k p , and M g (k,k') = K g (v g j,v g j') VI < 
k, k' < k g . Then E p = M" 1 and E s = A/^ 1 . In our experiments, we have chosen Gaussian kernels 
for both K p and K g , where the standard deviations are fixed: cr p = 0.2 and a g = 0.12 (for an 
estimation on [—1.5, 1.5] 2 and [—1, 1] respectively). 

For the stochastic approximation step-size, we allow a heating period k^ which corresponds to 
the absence of memory for the first iterations. This allows the Markov chain to reach an area of 
interest in the posterior probability density function q(j3,T\y,r)) before exploring this particular 
region. In the experiments presented, the heating time kh lasts up to 150 iterations and the 
whole algorithm is stopped at, at most, 200 iterations depending on the data set (noisy or not). 
This number of iterations corresponds to a point when the convergence is reached. We choose, 
as suggested in [7] the step-size sequence as = 1 for all 1 < k < kh and = (k — kh)~ - 6 
otherwise. 

The multicomponent case has to face the problem of its computational time. Indeed, as we 
have to approximate the posterior density by running elements of r m independent Markov 
chains, the computation time increases linearly with J^. In our experiments, we have chosen a 
fixed J for every EM iteration, J = 50. This is enough to get a good approximation thanks 
to the coupling between the iterations of the SAEM algorithm and the iterations of the Markov 
chains. Indeed, even if the parameter rj is modified along the SAEM iterations, its successive 
jumps are small enough to ensure the convergence of the MCMC method. Intuitively speaking, it 
is equivalent to consider not only 50 iterations of the MCMC method but 50 times the number of 
SAEM iterations. 
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5.1.2. The estimated templates. We are showing here the results of the statistical learning 
algorithm for our generative model. To avoid the problems shown in [2], we choose the same 
initialisation of the template parameter a as they did, that is to say, we set the initial value of a 
such that the corresponding I a is the mean of the gray-level training images. 

rnwm WMM 

Fig. 5.2. Estimated prototypes of the two components model for each digit (40 images per class; 100 iterations; 
two components per class). 

mm?, ?mm 

MOT] IBi 

Fig. 5.3. Estimated prototypes of the two components model for each digit (40 images per class, second random 
sample). 

In Figure 5.2, we show the two estimated templates obtained by the multicomponent procedure 
with 40 training examples per class. It appears that, as for the mode approximation algorithm 
which results are presented on this database in [1], the two components reached are meaningful, 
such as the 2 with and without loop or American and European 7. They even look alike. 

In Figure 5.3, we show a second run of the algorithm with a different database, the training 
images are randomly selected in the whole USPS training set. We can see that there are some 
variability, in particular for digit 7 where there were no European 7 in this training set. This 
generates two different clusters still relevant for this digit. The other digits are quite stable, in 
particular the strongly constrained ones (like 3, 5, 8 or 9). 

5.1.3. The photometric noise variance. Even if we prove the convergence result for fixed 
component noise variances, we still try to learn them in the experiments. The same behaviour 
for our stochastic EM as for the mode approximation EM algorithm done in [1] is observed for 
the noise variances. Indeed, allowing the decomposition of the class into components enables the 
model to better fit the data yielding a lower residual noise. In addition, the stochastic algorithm 
enables to look around the whole posterior distribution and not only focusing on its mode which 
increases the accuracy of the geometric covariance and the template estimation. This yields lower 
noise required to explain the gap between the model and the truth. The evolution of the estimated 
variances for the two components of each digits are presented in Figure 5.4. 

The convergence of this variance for some very constrained digits like digit 1 is faster. This 
is due to the well defined templates and geometric variability in the class which can be easily 
captured. Therefore, a very low level of noise is required very quickly. On the other hand, some 
very variable digits like digit 2 are slower to converge. The huge geometric variability adding 
to very complex shapes for the templates lead to a more difficult estimation and therefore more 
iterations before convergence. 

Last point that can be noticed is the convergence of the European 7 which looks slower than 
the other component (American 7) . The reason of this behaviour is that there are only two images 
of such a 7 in the training set and it takes a longer time for the algorithm to put together and 
only together these two shapes so that the clustering is better with respect to the likelihood. The 
other 7 does not suffer from this problem and converges faster. 
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Fig. 5.4. Evolution of the two cluster variances along the iterations. 



5.1.4. The estimated geometric distribution. To be able to compare the learnt geome- 
try, we draw some synthetic examples using the mixture model with the learnt parameters. Even 
when the templates look similar, the separation between two components can be justified by the 
different geometry distributions. To show the effects of the geometry on the components, we have 
drawn some "2" with their respective parameters in the four top rows of Figure 5.5. 



Fig. 5.5. Some synthetic examples of the components of digit 2: First four rows: templates of the two 
components deformed through some deformation field (3 and —f3 drawn from their respective geometric covariance. 
Two last row: template of the first component from Figure 5.2 with deformations drawn with respect to the second 
component covariance matrix. 



For each component, we have drawn the deformation given by the variable (3 and its opposite 
—/3 since, as soon as one is learnt, because of the symmetry of the centred Gaussian distribution, 
the opposite deformation is learnt at the same time. This is why sometimes, one of the two looks 
strange whereas the other looks like some element of the training set. 

The simulation is done using a common standard Gaussian distribution which is then multi- 
plied by a square root of the covariance matrix we want to apply. We can sec the effects of the 
covariance matrix on both templates and the large variability learnt. This has to be compared 
with the bottom rows of Figure 5.5, where the two samples are drawn on the one template but with 
the covariance matrix of the other one. Even if these six lines represent some "2"s, the bottom 
ones suffer from the geometrical tendency of the other cluster and do not look as natural. This 
shows the variability of the models into classes. 

5.2. Medical images. We also test the algorithm on a database which consists in 47 2D 
medical images. Each of them represents the splenium (back of the corpus calosum) and a part 
of the cerebellum. Some of the training images are shown in Figure 5.6 first row. 
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(a) (b) (c) (d) 



(e) 



Fig. 5.6. First row : Ten images of the training set representing the splenium and a part of the cerebellum. 
Second row : Results from the template estimation, (a) : gray level mean image of the ^7 images. Templates 
estimated (b) : with the FAM (c) : with the stochastic algorithm on the simple model (d,e) : on the two component 
model. 



The results of the estimation are presented in Figure 5.6, second row. The first picture 
presented, (a), is the gray level mean of the 47 images. The second one, (b), shows the estimated 
template computed with the Fast Approximation with Mode Algorithm presented in [1] for a 
single component model. This algorithm is an EM-like algorithm where the E step is simplified. 
The posterior distribution of the hidden variable is approximated by a Dirac distribution on its 
mode. This yields a deterministic algorithm, quite simple to implement but with no theoretical 
convergence properties. It shows a well contrasted splenium whereas the cerebellum remains a 
little bit blurry (note that it is still much better that the simple mean (a)). 

This picture has to be compared with picture (c) which gives the estimated template computed 
with our algorithm with r m = 1. The great improvement from the gray level mean of the images 
(a) or the FAM estimation (b) to our estimations is obvious. In particular, the splenium is still 
very contrasted, better localised and the cerebellum is reconstructed with several branches. The 
background presents several structures whereas the other estimates are blurry. The two anatomical 
shapes are relevant representants of the ones observed in the training set. 

The estimation has been done while enabling the decomposition of the database into two 
components with our SAEM-MCMC algorithm presented here. The two estimated templates are 
shown in Figure 5.6 (d) and (e). The differences can be seen in particular on the shape of the 
splenium where the boundaries are more or less curved. The thickness of the splenium varies as 
well between the two estimates. The position of the fornix is also different, being closer to the 
boundary of the image. The number of branches in the two cerebella also tends to be different 
from one template to the other (4 in the first component and 5 in the second one). 

The estimation suffers from the small number of images we have. This can be seen in the 
estimation of the background which is blurry in both images. To be able to explain the huge 
variability of the two anatomical shapes, more components would be interesting but at the same 
time more images required so that the components will not end up empty 

6. Proof of Theorem 4.2. We recall that in this section the variances of the components 
are fixed. This reduces the parameters 9t to (at,T g j) for all 1 < t < r TO . 

First let exhibit sufficient statistics for the model. The complete log-likelihood equals: 

( 1 \ |A|/2 / 1 



i=l 



log q(y,/3,T\r)) = ^ |log 
log 



+ l0g(Pr ! ) \ , 



1 \ k " 



where Kga = zpl a and ||.|| denotes the Euclidean norm. This emphasises five sufficient statistics 
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given in their matricial form for all 1 < t < r„ 





= E lr, 




l<i<n 


Sl,tiP,T) 


= E lr, 




Ki<n 




= E lr, 




Ki<n 


S 3 , t (P,T) 


= E lr s 




Ki<n 


S4,i(/3,r) 


= E In 



Thus we apply the stochastic approximation at iteration k of the algorithm leading to: 

Sk,m,t = Sk-l, m ,t + Ak(S mit (f3 k ,Tk) — Sk-l,m,t) 

for < m < 4 and rewrite the maximisation step. The weights and the covariance matrix are 
updated as follows: 



Pr.k 

r g,r,k 



Sfe,0, 



n + r m a p 



(Sk,0,rSk,3,T + a g^g) 



The photometric parameters are solution of the following system: 

- T |A | +a { s k,o,T {skA.r + {uT.kY Sk,2, T ot Tl k - 2(a T ,fc) t s/ £ .i,r) + a p a^) 



(6.1) 
(6.2) 



(6.3) 



'r,k 



which can be solved iteratively for each component r starting with the previous values. In this 
part, all a^. k are fixed and this leads to an explicit form for the parameters a T ,k- 

We will now apply Theorem 4.1 to prove Theorem 4.2. 
(SAO') is satisfied with the set S a defined by 

S a 4 { S G S I < S ,t < n, ||5 M || < Hvll, ||5 a ,t|| < n, < S 3 . t , < S 4 . t < \\y\\ 2 , VI < t < r m } . 

Thanks to the convexity of this set, the new value Sk defined as a barycenter remains in iS a . 
Assumption (SA1) is trivially satisfied since we can choose our step-size sequence (Ak)k- 

(SA2) holds as already proved in [2] for the one component case with w(s) = —l(fj(s)) such as 
(STABl'(ii)) with the same function W(s) = —l(i)(s)). These conditions imply the contraction 
property of the Lyapunov function w and the convergence of the stochastic approximation under 
some conditions on the perturbations. 

We need to suppose, like in the one component case [2], that the critical points of our model 
are in a compact subset of iS which stands for (STABl'(i)). This is an assumption which has to 
be considered in a future work. 

We will now focus on (STAB2) which is the assumption which gives the control of the per- 
turbations required for the convergence. 

We first show the convergence to zero of the remainder term |rfc|lv^( Sfc _ 1 )<M for any positive 
integer M. We denote by nk = Kf)( Sk ) f° r an Y k > 0. We have = K[S(j3 k ,Tk)\J r k-i] — 
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t=l 1=1 



E 7Tk _ 1 [S((3,t)] thus, 

r- n „ T m J k 

r k = £ / sog,T)n^_ 1)T 09 OJ ^)IJ / p Jk (rMi,Vi,vk-i)l J^A^,®^? 

C- 1 ) tOh Jt(0 

•R./ fc ( T l2/, = E[ / PJ h (n\€i,yi,Vk-i) <2(&) d£i- We can now rewrite 



We denote by Q(&)dfc = ft ft n„,t(£ t V ; > and by 

t=i i=i 



i=l 



W < 



^ / 5(^,r)[n^_ 1 ,.(/3o^)^(T|l/ ) %- 1 )^-7r % _ 1 (/3,- 



d/3 



d/3 



\Rj h (i-\y,r)k-i)\ 



E 



\Rj k {r\y,r) k -i) - g(r|y, r? fe _i)| . 



Denoting A4 7)t ,_ 1 = max,- J RN \S(/3,T)\q(j3\T,y,rjk-i)dl3, we obtain finally 



\rk\t W ( Sk ^)<M < E 



S(/3,t) Il^_ uT (f3 ,f3)-q(f3\T,y,ri k ^) dp 



+M Vh - 1 \ R Jk( T \y,Vk-i) - q(r\y,r] k -i)\ t w{Sk l) < M 



w(s k -i)<M (6-4) 
(6.5) 



We will first show that the Gibbs sampler kernel H n , T satisfies a lower bound condition and a Drift 
condition (MDRI) to get its geometric ergodicity (as it has been done in [2]). 
(MDRI) For any s G <S and any t G T, II^( s ) iT . is irreducible and aperiodic. In addition there 
exists a function V : M. N — > [1, oo[ such that for any p > 1 and any compact subset JC C S, 
there exist a set C, an integer m, constants 0</t<l,S>0,5>0 and a probability 
measure v such that 



^ hrf_ r II^ )iX (/3, A) > 5v(A) V/3 e C, VA e S(K ) 
sup n™ (s) T V*(/3) < K V*(/3) + Blc(/3) . 

s<EK.,t<£T 



(6.6) 
(6.7) 



Notation 1. Let (ej)i<j<jv ^ e ^ e canonical basis of the (3-space and for any 1 < j < iV, /ei 
E V! tj = { /3 G I 1 *' | ((3,ej), h T = 0} &e t/ie orthogonal of Span{ej} and p n ,T.j oe the orthogonal 
projection on E^tj i.e. 



-"J \\r),T 



where (/3,/3% )T = E7=iPK,lA for and 



1) ~v 



(i.e. the natural dot product associated 



with the covariance matrices (T g j)t) and ||.||t),r is the corresponding norm. 

We denote for any 1 < j ' < N, n G 0Pp and t G T, fey 11^. T j t/ie Markov kernel on 
M. N associated with the j-th Metropolis-Hastings step of the Gibbs sampler on M. N . We have 
H, h T = Ht),t,n o ■■■ o Tl V! T t i. 

Inequality (6.6) is equivalent to the existence of a small set C for the kernel II^( s ) jT . independent 
of s G JC. We recall here the definition of a small set: 
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Definition 1. (cf. [17]) A set £ G B(R N ) is called a small set for the kernelll if there exist 
an integer m > and a non trivial measure v m on B(M. N ), such that for all f3 G £ , B G B(M. N ), 
U m (f3,B)>u m (B). 

When this holds, we say that £ is v m - small. 

We now prove the following lemma: 

Lemma 6.1. Let £ be a compact subset ofM. and K, be a compact subset of S, then £ is a 
small set ofR N for (Uf,( s ) :T ) s &ic,TeT ■ 

Proof. The transition probability kernel of our Markov chain on (3 is defined as follows : for 
coordinate j, the kernel is 

U 7hTJ (f3,dz) - (® m#J -V(<fc m ))P [ qj (dz j \/3- j ,r),T) rj (l3 j ,d Z i;l3- j , V ,t)+ 

5 p] {dzi) [ (1 - rji/y ,b;0- j ,v,T)) qj (b\/3- j ,t],r)db 



(6.8) 

Then note that there exists a c > such that for any r\ G QFg, any (3 G M. N and any b G R, the 
acceptance rate Tj{(3 J ,b; (3~ J , ?/, t) is uniformly lower bounded by a c so that for any 1 < j < N 
and any non-negative function /, 

n,,,T,j/(/3) > a c / f(f3^ J +bej)qj(b\f3^ J ,r,r])db = a c / f(p n ^j(f3) + ze J /\\e j \\ n , T )g ,i(z)dz, 

JR JR 

where go,i is the probability density function of the standard 7V(0, 1). By induction, we have 
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9oA z j) dz j , (6.9) 



where p n ,T,q,r = Pn,T,r ° Pr),T,r-i o ■■■ op v ^ q for any integer q < r and p v ,t,n+i,n = Id R N. Let 
^4jj,t € £(K ) be the linear mapping on = (zi, • • • , zjv) defined by 

N 

Ay.-r^f = ^ z jPri,T,j+l,N{ej)/\\ej\\ri,T ■ 

One easily checks that for any 1 < k < N, Span{ Pri,rj+i,N{^j)> k < j < N} = Spanjej, k < j < 
N} so that A v ^t is an invertible mapping. By a change of variable, we get 



/ f (pt,,t,i,n(P) + A n ,TZi) T\go,i{zj)dzj = / f(u)g p nW , a A t (u)du. 

JR N ■ , JR N 



N 

g Qt i{zj)dZj = 
j= i Jm. n 

where g^T. stands for the probability density function of the normal law jV(/z, E). 

Since (77, t) — > A^ jT - is smooth on the set of invertible mappings in (77, t), we deduce that 
there exist c^; > and Cx; > such that c/cld < A^^A^^ < Id/c^e and g Pri T t n (/3) ! a 7J t a* t (u) > 
Ck.9p v t j jv(/3),w/c( u ) uniformly for ?y = fj(s) with s G K, and r G T. Assuming that (3 £ £, since 
7] — > Pr/,T,i,N is smooth and £ is compact, we have sup^g^ r)= ^( s ). se/c.-reT ||Pr;,-r,i,iv(/3)|| < 00 so 
that there exist other constants Cjc > and > such that for any (u, /3) G R^Pf and any 
77 = r)(s), s G /C, r G T 

fl r p^,-r,l,w09).^,-r^,-r('") - C IC90,Id/cA U )- ( 6 ' 10 ) 

Using (6.9) and (6.10), we deduce that for any A H n ^(f3, ^4) > C^a^ , with v equal to the 
density of the normal law Af(0, Id/c/c). This yields the existence of the small set as well as equation 
(6.6). □ 
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This property also implies the 0-irreducibility of the Markov Chain generated by TL„ T . More- 
over, the existence of a z^-small set implies the apcriodicity of the chain (cf. [17]). 

Now consider the Drift condition (6.7). 

We set V : K w — > [l,+oo[ as the following function V((3) = 1 + |/3|| 2 , where || • || denotes the 
Euclidean norm. Define for any g : R w — > R" s the norm \\g\\v = sup ^yPJ) and the functional 

space Cy = {g ■ — > IR" 3 | ll^ljy < +oo}. For any r\ £ 0Po and any r £ T, we introduce a 
(77, t) dependent function V5j iT (/3) = 1 + ||/3||^ T . 

Lemma 6.2. Let K be a compact subset o/OPg. For any integer p > 1, £/iere exzsi < p < 1 
and C > swc/i i/iai /or any r\ £ K, any r £ T, any /3 £ we /lave 

n^K^(/3)<p^(/3) + c. 



Proof. The proposal distribution for n„ XJ ' is given by qj(f3-* \ (3 3 , r, y,r/) ^ p„ x ,-(/3) + 
ej , where z ~ A/"(0, 1). Then, for any f3 £ M. N and any measurable set A £ B(M. N ), there 



exists a v . T j (/3) uniformly bounded from below by a c > such that 



n, ; ,-rj(/3,-4) = (1 - S,T,j(/3))lA(/3) + ar,,-rj(/ 3 ) / t A \ Pv,T,j(fl) + m — n — e i ) 9os(dz) , 



Since (p,, T ,j(/3), ej->,, T = 0, we get V^ )X (pri,T,j(0) + |Ji"f^7 e j) = ^.tGVtjO^)) + 
deduce that there exists C such that for any /3 G R w : 



z 2 . We 



IL ; , XJ V^(/3) = (1 - a„, xJ (/3))V^(/3) + a i; ,xj(/3) / (^(j^,,^)) + z 2 ) p 5o ,i(^) 
<(l-a t ,,^(/3))y^(/3)+a l; , XJ (/3)l 

^ T (p, )T , i (/3)) + V^- 1 (p J7 , Tj (/3)) / (l + z 2 )Vi(^) 

< (1 - a, )T>i G9))V^ T 0g) + a„ )T>3 -G9)^: T (p, >T ^C9)) + cvfc^M) ■ 

We have used in the last inequality the fact that a Gaussian variable has bounded moment of any 
order. Since a ri . 1 - - j(f3) > a c and Hp^.t-.j (/3)||tj,t- < ||/3|U,t (Pn,T,j is an orthonormal projection for 
the dot product (■, -)^,t)) we get that for any e > 0, there exists Cjf i£ such that for any (3 £ M. N 
and r) £ K,t £ T 

^jVP T (/3) < (1 - a c )V* T (/3) + (ac + e)^ T (p„, TJ -G9)) + C*, e . 

By induction, we get 

N 

n,,^(/3)< ]T n( 1 - a c ) 1 ^ J ( a c + e )" J ^(p^-«(/ 3 )) + ^ £ (( 1 + e ) Ar+1 - 1 ) . 

ti£{0,l}"j=l 

where p n ,T,u = ((1 - u/v)Id + unP v ,t,n) o • ■ • o ((1 - m)Id + uiP v ,t,i). 

Let p, h T = p v ,-r,N • • • °Pr),T,i and note that p v ,t,u is contracting so that 



IV^ T (/3) < &c, £ ^ T (/3) + (oc + e) Ar ^(p, ; ,.(/3)) + ^ ((1 + e) W+1 ) 

for b C)£ = (j2ue{oi} N u=£i IljLi(l — a c) 1_Uj ( fl c + e) Uj ^j . To end the proof, we need to check 
that p, h T is strictly contracting uniformly on K. Indeed, ||p^ iT (/3)||^ iT = ||/3||r;,T implies that 
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Pri,T,j{0) = P for any 1 < j > < N so that (/3,ej) VjT = and /3 = since (ej)i<j<N is a basis. 
Using the continuity of the norm of p„ T and the compactness of K, we deduce that there exists 
< pk < 1 such that ||p?7,t(/3)||?7,t < /Ox||/3||j7,t- for any /3 € 1 N , n £ K and any r G T. Changing 
for 1 > > pk we get (l + /3^-||/9||^. 1 -) p < + \\P\\r h -r) p + Ck for some uniform constant 

Ck so that 

U n , T V p T (P) < & c ,e^ x (/3) + P'^(a c + e) N V* r (j3) + C K ,e- 
Since we have inf |fe c ,e + p'^k (a c + < 1 the result is straightforward. □ 

Lemma 6.3. For any compact set K C &Pg, any integer p > ; i/iere eirisi < p < 1, C > 
and a positive integer mo such that Vm > mo , Vr? G K , V/3 G T 

n£ T v»G9)<pV*09) + c. 



Proof. Indeed, there exist < c\ < c 2 such that ciV((3) < V VtT (f3) < C2V(0) for any 
(P,T),t) G R w PA'PT. Then, using the previous lemma, we have IL™ T VP(P) < <^ P H% T VV T (P) < 
C - p (p m VP T (f3) + C/(l - p)) < (c 2 /ci)P(p m V p (f3) + C/(l - p)). Choosing m large enough for 
{c-ijcif ' p m < 1 gives the result. □ 

This finishes the proof of (6.7) and in the same time the (MDRI). 

Thanks to this property we can use the following proposition (cf. [17], [5] Proposition Bl) and 
xia applied to 
and all 1 < i < n. 



lemma applied to every sequence with stationary distribution q(-\yi, t, rj) for all 1 < t < r, 



Proposition 1. Suppose that H is irreducible and aperiodic and that U m (f3 , .) > lc(P )Sv(.) 
for a set C G B(M. N ), some integer m and 8 > and that there is a Drift condition to C in the 
sense that, for some < K < 1, B > and a function V : — > [1, +oo[, 

IIV(/3 ) < kV(P ) V/3 i C and sup (V(f3 ) + UV(J3 )) < B . 

p„ec 

Then, there exist constants K and < p < 1, depending only upon m,S,K,B, such that, for all 
P G R N , and all g G C v 

\\IL n g(P )-7r(g)\\ v <Kp n \\g\\ v . 



Lemma 6.4. Assume that there exist an integer m and constants < k < 1 and q > and a 
set C such that 



K m V(P ) < kV(P q ) V/3 i Cand IIF(/3 ) < qV(P ) V/3 G 



niV 



for some function V : R — > [1, +oo[. Then there exists a function V and constants < p < 
1, c > and C > 0, depending only upon m, K, q, such that, 

nV"(/3 ) < pV(P ) yp i CandcV <V < CV . 

Proof. Define 

Til 
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For f3 £ C, we have 



Therefore we obtain 



m— 1 



ny(/3 ) < ^- j/m ^ j v(i3 ) + kV((3 ) 



K l-l/my < y < / ^ K i_ 3 y m ^_i j y 

This ends the proof of Lemma 6.4. □ 

Thus, applying the Proposition 1 and Lemma 6.4 to the Drift conditions of Lemmas 6.2 and 
6.3, we get that each Gibbs sampler kernel IL, T is geometrically ergodic. 

Let us now go back to the convergence of the first part of the residual term (6.4) towards 0. 

We use the term ^w(s k -i)<M to show that the parameters rjk-i are constrained to move in 
a compact set of QPg. We show first that the observed log-likelihood I tends to minus infinity as 
the parameters tend to the boundary of QPg. Equation (2.1) implies that for any 9 £ we have: 

g^lft^a^MAIIVJ < (2 7 ra 2 )-l A l/ 2 (27r)- fc S |L g , Tt |- 1 / 2 exp (-^fir^fr 

so that denoting C as a constant: 



log(g(y,7y)) < ^ 



— 2" g.-ri'^s/F H 2 — lo 8l i 9,^l _ 2^2 2 l0g ^ <7 ^> 

l+a„ 



c. 



It was shown in [11 that we have lim — 4f(F , E )p H ^ log IF 1 | = —00 

||r||+||r-i||—oo 2 y 2011 

and lim — h(a — /i p )'S~ 1 (a — fj, p ) = —00. Moreover, we have limp_»o log(p) = —00 , so we get 

||a|| — >oc 

lim J) _ v g(ei» e ) log q(y, 77) = — 00 , which ensures that for all M > there exists I > such that 
IMI > * or ||r t || + \\T^\\ > I or p t < \ implies -1(77) > M. 

So W(sfc-i) < M implies that for all 1 < t < r m we have ||a t || < I, \\T t \\ + \\T 1 T 1 \\ < I and 
j < Pt < 1 ~ j because J2t=i Pt = 1 - 

Let us denote by Ve = 6j m P {(pt)i<t<r m € [7, 1 — j\ m \ Ym"=\ Pt = l} , where 



(a,r„) I aeR k », r s GSym+ ( 



HI<Aj<l|r ff ||<^ 



So there exists a compact set Ve of 0Pp such that W(sk-i) < M implies ?)(sfc-i) G and the 
first term (6.4) can be bounded as follows: 



d(3 



W{s k - l )<M 



£ / %r)k l!T ^/3)-?(|3|T,!/,%-i) 

<Vsup / 5(/3,r)[n^(/3 ,/3)-9(/3k,y,77)]d/3 

Since for each r the function (3 — > 5(/3, r) belongs to £y, since we have proved that each transi- 
tion kernel H n ,T is geometrically ergodic and since the set Ve is compact, we can deduce that the 
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first term (6.4) converges to zero as Jk tends to infinity. 



We now consider the second term (6.5). We first need to prove that M Vk ^w(s k ^ 1 )<M is 
uniformly bounded that is to say the integral of the sufficient statistics arc uniformly bounded on 
{W(sk-i) < M} ; we only need to focus on the sufficient statistic which is not bounded itself: let 
(j,m) € {l,...,2fc s } 2 : 

/ \/3 j /3 m \q(/3\T,y,n k -i)df31 m _ ieVl < [ \P j P m \ ^ ^ ^ d/31 m _^ Vt 



< 



q( 



^/,^,ex P (-^r- Ul /3)^ 



1, 



<c(v e ) J Qog,r 9lT , fc _i)fflcp \--\\f3\\')d(3<oo, 

where C(Ve) is a constant depending only on the set Ve, F g . T is the diagonal block matrix with 
all the T g ^ Ti given by the label vector t and we have changed the variable in the last inequality 
and Q is a quadratic form in (3 whose coefficients are continuous functions of elements of the 
matrix T g . So we obtain that for all M > there exists £ > such that for all integer k we have: 
< C(V t ). 

We now prove the convergence to of the second term of the product involved in (6.5). Let 
us denote by TZ T y ^ for the term \Rj k (t|i/, %-i) — <z(t|?/, %-i)|- Thus we have: 



n 
i=l 



n „ n 
II / PJk( T i\£i>yi>Vk-l)Q{€i)d£i - Y[<l(Ti\yi,r)k-l) 



pj k ( T i\£i,yi,r)k-i)Q(€i)d4i - q(n\yi,vk-i) 

- E / \pjh( T Mi,yi,Vk-i) - q(n\yi,r]k-i)\ Q(b)d£-, 
i=i J 

f s Jk( r i,yi\& i ,i,'nk-i) q(n,yi\rik-i) 



E s Sjk ( s > Vi\€s,i,rik-i) q{yi\m-i) 



where we denote by Sj(t, yi\£t,i, v) the quantity I j ^ 

V i=i 

We write each term of this sum as follows: 

Sj k (n,y l \^ Ti ,i,Vk-i) q(n,yi\r)k-i) _ 



Q(vi, il'.lMv) 



E Sj k (s,yi\^ S! i,ri k . 

s=l 



q(vi\vk-i) 



Sj k (n,yi\^ n ,hVk-i)(q(yi\vk-i)- E Sj fc (s,sfe|&,t,»7k-i)) 

s=l 

q{yi\m-i) E Sjk( s ,yMs,i,rik-i) 



(S , j i ,(r j ,j/ i |^ Tiii ,%_i) -q(n, yi\r} k -i)) E s J k ( s ; Vi\£s,i, Vk-i) 

s=l 



q(yi\vk-i) E #./ fc (s,yi|£ s ,;,%-i) 



Denoting by 7^ the set of r m + i integers {!,•■• , T m } U we obtain finally: 



K 



■r,y.k 



< 



^"TT" — \Y1 I \ s J k ( s >yMs,i,Vk-i) - q( 
t! q(y* \vk-i )f£j 
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Defining the event A kti)t = {\Sj k (t,yi\tt,i,r)k-i) - Q(i>Vi\Vk-i)\ > Ck} for some positive sequence 
(Cfc)fc, we g et: 



T^T,v,k < E 



E 



E 



A Mtl 



E 



|£j fe (s,yi|£ s ,i,?7fc-i) - 3(s,y»|»7fc-i)| Q(&)d& 



\Sj k (s,yi\Ca,i,Tfk-i) - g(s,2/i|77fe_i)| Q(&)<2& . 



So we deduced that: 
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T,J/,fc 



< 



E 



Y)(sup5'j- k (s,3/ j |^ s i,r/ fe _i) + g(s,j/i|%_i))P04 A 
S 

(r m + l)Cfc 



< 



E 



Assuming Cfc < mini,t we obtain: 

P(Ag, iit ) = tfil^, - g(t, W |»7 fc -i)| < Cfe) 

1 1 



> P 

> P 



Sj k (t, yMt,i,Vk-i) q(t, Ui\r]k-i) 

1 1 



< 



Sj k (t, yi\£t,i,Vk-i) q(t, yi\vk-i) 



< 



q(t, Ui\Vk-i) (q(t, yi\r) k -i) + Ck) 

Ck 



2q{t,yi\vk-i 



where 



Using the first inequality of Theorem 2 of [9], we get: P(Ak t i,t) < c -i CX P (~ c 2 q ^ t ^ya j , 
c\ and C2 are independent of k since (77^) only moves in a compact set thanks to the condition 
%(« M <M). This yields: 



7<~r,y,fc < Ci > j— r h 1 

^ \ q(yi\Vk-x) 



(r m + 1) exp -c 2 



sup „ E 



max i q(2/ i |?7 fc _i) 4 
1 



(r m + l)Cfc. 



We have to prove that the Monte Carlo sum involved in Sj k (s, j/i|£ Si i, %-i) does not equal zero 
everywhere, so that sup^ s Sj,. (s, yj|£ s ,i) %-i) is finite. For this purpose, we can choose a particular 
probability density function /. Indeed, if we set / to be the prior density function on the simulated 
deformation fields £, we have for all r\ <E Vf. 



1 i 

1^ 



1 



q{yi\Z$,t,v)q(t\v) 



> 



i J 



i=i 



> (2w 2 )l A l 



where cr is the lower bound of the variances (cj). 
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We choose the sequence (Cfc)fc depending upon (Jk)k such that lim = and lim JkC 2 



k — >c 



— 1/3 

+00. We can take for example Cfc = for all k > 1. 



We will now prove the convergence of the sequence of excitation terms. 

n 

For any M > we define M n = ^k^k^-w(a k -i)<M an d let T = (Tk)k>\ be the filtration, 

fc=i 

where Tk is the c— algebra generated by the random variables (So, f3 1; . . . , /3 k ,T±, . . . , Tfc). We 

n 

have M n = J2 A fc ( s (Pki T k) ~ E [S(/3 fe , Tfc)| J^-i]) lw(s fc _i)<M so this shows us that (M„) is a 
fe=i 

^"-martingale. In addition to this we have: 

00 00 00 

[||M fc - A4-!!! 2 I H-i] = E E [ A *l|efcll ai w(- k -0<« I ^-1] < E A?E[|| e ,|| 2 I J- fc _!] 

fe=l fe=l fe=l 

< E A 2 fe E [||509 fcj T fc ) - E [5(^,^)1^!] || 2 I 

fe=l 
OO 

<EA 2 fc E[||S(/3 fe ,r fc )|| 2 \Tk-i] • 
fe=i 

We now evaluate this last integral term: 



3 



\S((3 k ,T k )\\ 2 I ^-l] =E / / l|5G9,T)||XLi,r09o^)n^^-i( T *'^,i.W)O(Cr i ,i)d£r i ,id^ 

< E / IW>r)||XU-r(A>,/W [/"i^U-rC&.Ode • 
7 Jk n J LV 

The last term equals one and again we only need to focus on the sufficient statistic which is not 
bounded itself. Indeed £3,4 (/3, r) for all 1 < t < r m so using the fact that the function V dominates 
this sufficient statistic, we obtain: 

E[\\S 3 , t (Pk,T k )\\ 2 I F k -i] <E / \\S3A^r)fn^_ i , T (f3 ,f3)df3 

~ Jr n 

< CE / V(j3) 2 lI%_ liT (j3 o ,p)d0 < Cj;^.,,^) 1 • 

Applying Lemma 6.3 for p = 2, we get: 

E [||S(/3 fc , r fc )f |^ fc -i] <Cj2( P V^o) 2 + C)<CT^(pV(f3 ) 2 + C). 

T 

OO OO 

Finally it remains: ^2 ^ [ll^fe — ^fc-i|| 2 I ^fc-i] < C E ^fc j which ensures that the previous 

fc=i fc=i 
series converges. This involves that (Mk)keN is a martingale bounded in L 2 so that lim Mk exists 

k — 'OO 

(see [13]). This proves the first part of (STAB2). 

To conclude this proof we apply Theorem 4.1 and get that lim d(sk,C) = 0. 

k — >oc 

7. Conclusion and discussion. Wc consider the setting of Baycsian non-rigid dcformable 
models building in the context of [1] and the associated MAP estimator. We approximate this 
estimator of this generative model parameters thanks to a stochastic algorithm which derives 
from an EM algorithm. We also prove its theoretical convergence toward a critical point of the 
observed likelihood. This is, to our best knowledge, the first convergent estimation algorithm 
of the templates and geometrical variabilities in the framework of mixture model for deformable 
templates. The algorithm is based on a stochastic approximation of the EM algorithm using a 
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MCMC approximation of the posterior distribution and truncation on random boundaries. We 
present experiments on the US-postal database as well as on some 2D medical data. This shows 
that the stochastic approach can be easily implemented with the algorithm detailed here and is 
robust to noisy situations, giving better result than the previous deterministic schemes. 
Many interesting questions remain open. 

The first goal of these model and algorithm is the estimation of some atlases in a given 
population and the acceptable deformations of these atlases that can explain the variability in the 
population. However, this model, as soon as the parameters are estimated, can be used to create a 
classifier. Given a new image, one can compute the most likely component that this image belongs 
to. This computation requires to evaluate the integral of the complete likelihood with respect to 
the posterior distribution as well as in the estimation process. A first proposition to overcome this 
difficulty has been given in [1] while approximating the posterior distribution by a Dirac on its 
mode. This gave very interesting results which are presented in that paper. However, in the case 
of noisy images, the same problem occurs and leads to bad classification ratios. Another way has 
been proposed in [3] using the same methods as in this paper, that is to say, using Monte Carlo 
Markov Chain methods. The results are impressive and the improvement noticeable. 

We have presented here a set of experiments on 2D images. The generative model as well as 
the algorithm and the proof of its convergence do not depend on the dimension of the images. 
The implementation for 3D images is only a numerical issue. We are currently working on the 3D 
codes to test this algorithm on real medical databases. 

An interesting extension would be to consider diffeomorphic mapping and not only displace- 
ment fields for the hidden deformation. This appears to be particularly interesting in the context 
of Computational Anatomy where a one to one correspondence between the template and the 
observation is usually needed and cannot be guaranteed with linear spline interpolation schemes. 
This extension could be done in principle using tangent models based on geodesic shooting in the 
spirit of [20]. Many numerical as well as theoretical work need to be done in this area. 
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